Unsupervised Crosslingual Embeddings


This model uses a GAN to learn mapping between two language embeddings without supervision as demonstrated in Word Translation Without Parallel Data [1].



  • nlp_architect/data/fasttext_emb.py: Defines Fasttext object for loading Fasttext embeddings
  • nlp_architect/models/crossling_emb.py: Defines GAN for learning crosslingual embeddings
  • examples/crosslingembs/train.py: Trains the model and writes final crosslingual embeddings to weight_dir directory.
  • examples/crosslingembs/evaluate.py: Defines graph for evaluating the quality of crosslingual embeddings


Main arguments which need to be passed to train.py are

  • emb_dir: Directory where Fasttext embeddings are present or need to be downloaded
  • eval_dir: Directory where evaluation dictionary is downloaded
  • weight_dir: Directory where final crosslingual dictionaries are defined

Use the following command to run training and generate crosslingual embeddings file:

python train.py --data_dir <embedding dir> --eval_dir <evaluation data> \
  --weight_dir <save_data> --epochs 1

Example Usage

Make directories for storing downloaded embeddings and multi language evaluation dictionaries

mkdir data
mkdir ./data/crosslingual/dictionaries

Run training sequence pointing to embedding directory and multi language evaluation dictionaries. After training it will store the mapping weight and new cross lingual embeddings in weight_dir

python train.py --data_dir ./data --eval_dir ./data/crosslingual/dictionaries --weight_dir ./


When trained on English and French embeddings the results for word to word translation accuracy are as follows

Eval Method K=1 K=10
NN 53.0 74.13
CSLS 81.0 93.0


[1]Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Herve Jegou Word Translation Without Parallel Data https://arxiv.org/pdf/1710.04087.pdf
[2]P.Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information https://arxiv.org/abs/1607.04606